Tiago B.N. Silveira, MSc.
July 15, 2018
This project is part of the Udacity Data Analysis Nanodegree and aims to apply exploratory data analysis (EDA) techniques using R in order to find out relationships in one to multiple variables, as well as exploring a given dataset for distributions, outliers, and anomalies.
EDA can thus be understood as the numerical and graphical examination of data characteristics and relationships before formal, rigorous statistical analyses are applied. EDA can lead to insights, which may uncover to other questions, and eventually predictive models. As explained during the Udacity course, EDA is also an important “line of defense” against bad data and is an opportunity to notice that some assumptions or intuitions about a dataset are violated.
The first step before diving down EDA is to provide a short introduction into the dataset we are planning to explore. In this case, from a list of available options I have chosen the “Wine Quality” dataset (available at http://dx.doi.org/10.1016/j.dss.2009.05.016) since I’m a wine lover.
Said that, and since our data is already loaded, the first thing to do is to take a look at the very first line of the dataset:
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.4 0.70 0.00 1.9 0.076
## 2 2 7.8 0.88 0.00 2.6 0.098
## 3 3 7.8 0.76 0.04 2.3 0.092
## 4 4 11.2 0.28 0.56 1.9 0.075
## 5 5 7.4 0.70 0.00 1.9 0.076
## 6 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality
## 1 5
## 2 5
## 3 5
## 4 6
## 5 5
## 6 5
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
We already knew it is a tidy dataset. Anyway, it was possible to check there is any missing value (NA values, for example) through the summary output.
To understand each data attributes, it’s helpful to consult the data description where the following information is found:
The two datasets are related to red and white variants of the Portuguese “Vinho Verde” wine. For more details, consult: http://www.vinhoverde.pt/en/ or the reference [1]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc).
It’s also important to notice that besides the target feature is the wine quality, the dataset is unbalanced regarding to it. It means there are few excellent or poor wines. This issue will be addressed later. For now, let’s start taking a look at each attribute in order to understand our data in deepen way.
In order to catch the meaning of each variable in the dataset, in this section we will seek to explore each attribute in what is known as univariate analysis. The most relevant attributes of this dataset are consolidated at the end of this section.
Since it’s the target attribute, there is any doubt about its relevance. That way, let us first understand its distribution through a bar chart, since quality is an ordinal data.
The first thing to notice is that there are few really bad wines, i.e. those whose quality is below 3. In the same way, there are few really good wines, ie, those whose quality is above 8. It’s also effort to notice this distribution seems like a normal one.
Considering this is the target attribute, all the upcoming bivariate analysis will be in relation to this attribute, in the next section. By now, let’s take a look on the distribution of the other dataset variables, together with a boxplot in order to show out the outliers.
This attribute refers to the percent alcohol content of the wine (% of volume).
We can notice from the boxplot there are few outliers in this attribute distribution, i.e. most of the samples stands inside \(1.5\) times the interquartile rate (IQR). From the histogram, the cartesian coordinates are adequate to understand this variable distribution, which is concentrated on 9% values.
Accordingly to the dataset documentation, it refers to a wine additive which can contribute to sulfur dioxide gas (\(S0_2\)) levels. This last acts as an antimicrobial and antioxidant. The sulphates levels are measured by the concentration of potassium sulphate in g/dm³.
To find out the range of sulphates in the evaluated wines, we can just summarize this attribute, as follow:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0,3300 0,5500 0,6200 0,6581 0,7300 2,0000
As we did before, the next graphs show how this attribute is distributed along our data:
The values of this attribute are distributed around the mean showing also some skewness, which is justified by the high number of outliers, as shown in the boxplot.
This attribute describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale.
Again, our first step will be checking the values for this attribute using the summary command, whose values around 3.31 confirm the above statement.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2,740 3,210 3,310 3,311 3,400 4,010
Trying to see this same information through the histogram, in order to confirm visually if it follows a normal distribution:
As shown above by the plots, the wine pH follows a quasi-normal distribution. We can also confirm that it varies in the range of 3.0 to 3.6, also with some outliers beyond this range.
The wine densities vary accordingly to density of the water and on the percent alcohol and sugar content. I particularly prefer full-bodied wines to those lighter ones, characteristics I immediately associate to density. But as far as I know, there are both strong and soft great wines, in the way I don’t think density would be a decision attribute for wine quality. I will check this assumptions in the bivariate analysis. For now, let’s see how this attribute is distributed in our data:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0,9901 0,9956 0,9968 0,9967 0,9978 1,0037
From the summary information, the variance is about 0.01 or 1%. It means a variation of just 1g for each 100ml of wine. I wonder if humans are able to perceive such small variance, which has lead me to an interesting reference [2] where it is said that the term “body” or “weight” of a wine, besides is almost always a desirable aspect, is a wine property not fully defined. In the same reference, what called my attention is that sweetness and the presence of glycerol can increase the perception of body, while acidity appears to reduce the perception of body. This description has clarified my confusion around wine density, in the way we now follow with the analysis:
As explained in the dataset description, the sulfur dioxide \(SO_2\) can be found in its free-form, as a dissolved gas or as a bisulfite ion, or in bounded forms with other molecules. Besides its function to prevent microbial growth and wine oxidation, it is mostly undetectable in wine. Free concentrations over 50 ppm, though, turns it evident in the nose and taste of wine.
## free.sulfur.dioxide total.sulfur.dioxide
## Min. : 1,00 Min. : 6,00
## 1st Qu.: 7,00 1st Qu.: 22,00
## Median :14,00 Median : 38,00
## Mean :15,87 Mean : 46,47
## 3rd Qu.:21,00 3rd Qu.: 62,00
## Max. :72,00 Max. :289,00
The summary above shows out there must be an outlier on data.
From the boxplot it’s clear to see there are two outliers in the total measurements whose amount of sulfur dioxide goes above \(200 {mg}/{dm}^3\). I will back to this point later during the multivariate analysis.
In the charts above I intended to plot the histogram both for the free form and the total amount of sulfur dioxide. It is clear to observe that the total amount follows the same shape as for the free form, the first showing a higher concentration as expected.
This attribute stands for the amount of salt (sodium chloride) in the wine, measured in \(g/{dm}^3\), i.e. in grams per liters.
From the boxplot we notice there are outliers in the right-skewed chlorides distribution. From the histogram, these outliers are visible when we use a log10 scale for y-axis.
This attribute stand for the amount of sugar remaining after fermentation stops. Accordingly to the dataset documentation, it’s rare to find wines with less than 1 gram/liter (or \(g/{dm}^3\) in the International System of Units - SI). On the other hand, wines with more than 45 grams/liter of sugar are categorized as sweet.
Analyzing the histogram above, even though there are some outliers, none of them is above \(45 g/{dm}^3\), which means that the whole dataset is about dry wines.
There are three attributes in the dataset related to the wine acidity, all of them measured in \(g/{dm}^3\):
* Fixed acidity: normally refers to tartaric acid which are involved with wine and do not evaporate readily.
* Volatile acidity: the amount of acetic acid in wine, which at too high levels can be unpleasant, tasting like a vinegar.
* Citric acid: usually found in small quantities, citric acid can add some freshness and flavor to wines.
From these charts, and considering the description for each kind of acidity, we can draw as hypothesis that the citric acid outliers relates to some wines whose main characteristic is its freshness or its fruit flavor. The fact that both citric acid and volatile acidity log distribution have the same shape called my attention, at first. However, as we shall see in the further analysis, it does not mean any relationship among these attributes by now. The fixed acidity, on the other hand, presents values in a higher scale as well as more outliers.
In this Section I explored the Red Wine dataset built from a scientific research about whose objective was to model wine preferences by data mining from physicochemical properties [1].
This first analysis, which is called univariate since it considers only one attribute at a time, is essential because it allows both a general understanding of the subject to be explored and the meaning of each available attribute in the dataset.
At this point, we are already able to select some important attributes in order to check how they relate to each other:
The quality attribute will be our targeted variable throughout the whole analysis. As a premise, this analysis objective is to find out which attributes are related to wines of better quality, ie, those assigned with a quality level \(\geq 7\).
The description obtained for each attribute, such as alcohol, pH, and density, for example, will allow us to analyze them jointly with the quality attribute in the next section, in order to seek for correlation.
Many attributes have presented outliers in some level. We could define a strategy of outlier removal or even though a deeper analysis on them, trying to understand why these values are distant from the mean distribution. In this analysis, though, we will only describe them.
In this section, I will start exploring the relationships among all variables in our dataset, directing the analysis to find out which attributes are mainly related to our target variable: the quality score.
From the correlation matrix we notice the only relevant attributes to quality are (i) the percentage of alcohol and (ii) the volatile acidity. There are some relevant correlation among other variables as well, and even though they don’t contribute explicitly to the target variable, it will be insteresting to take a look on them trying to find out some new information or insight.
In order to select which of them we will work with, we should analyze the matrix above considering the previous univariate analysis for each attribute. In this case, we can draw the following assumptions:
The wine quality is positively correlated to alcohol and negatively correlated to the volatile acidity. It does make sence with our previous findings, since the volatile acidity is responsible for an unpleasent taste in the wine.
It is quite obvious that the pH attribute is negatively correlated to acidity, since the lower values in the pH scale means a higher acidity. Since it is a well-known fact, there is no sense to explore these relationships here.
There seems to be some correlation between sulphates and chlorides. However, none of them presents a relevant correlation with quality. Furthermore, from the foregoing description it is possible to infer that both attributes are derived from the same mineral, so that we will not consider these attributes in the next analysis.
It is interesting to observe there are some relationship between density and other attributes such as alcohol, residual sugar, citric acid, and fixed acidity.
It also seems to be evident that the free and total forms of sulfur dioxide will be related to each other. The same stands for the different acidity types. Regarding this last, in order to simplify this analysis, we will now consider only the volatile acidity due to its impact on quality levels.
Based on this, our next task is to create some pairplots relating all these chosen attributes:
The pairplots above summarize most of the information we have gotten until now, as for example the unbalanced distribution of quality – ie, there are more wines scored as 5 or 6 than those below and above these values.
Accordingly to the correlation matrix, alcohol and volatile acidity are the attributes with the most relevant relationship to quality. My first idea was to create a scatter plot with a smooth line to analyze it deeply. However, this type of visualization is not appropriate when one of the features on the chart is categorical or discrete, as is the case of quality. A boxplot, though, brings so much more information, as follow:
As it was expected to, there is some correlation between the alcohol percentage in the wine and its quality. However, as it’s clear to see from the scattered points, different quality levels are assigned when the wine has 11% of alcohol. This leads us to afirm that relevant correlation between alcohol and quality arises only when we are dealing with the bordering values.
Another way to check it out is building a radar chart with the alcohol statistics for each one of the quality levels:
From the chart above it becomes really clear that the wines assigned with a better quality in average have higher alcohol percentages.
The next chart brings the same analysis through boxplots but now considering the volatile acidity versus quality levels.
From the boxplots we can notice there are three main categories of quality we could define: low, equivalent to the scores 3 and 4, middle, equivalent to the scores 5 and 6, and high, equivalent to the socres 7 and 8. We also can notice that the volatile acidity is negatively correlated to quality.
Following I try to explore other relationships among our selected variables. The results are quite similar to those found on the pairplots, besides now with a better resolution and a smooth line that allows us to comprehend how this correlation would be for different numerical values.
Most of the information we could gather from the dataset variables were already obtained in the previous section. By now, we know alcohol is the attribute in our data most correlated to our target variable. In a lower rate, the volatile.acidity is also correlated. The next charts will try to explore the relationships, among others, of these three variables:
Previously we have found that the alcohol percentage correlation to quality levels is higher for bordering values. From the chart above we perceive it also becomes true when we correlate alcohol and the volatile acidity. Our next chart shows the boxplot for the correlation between the volatile acidity and different buckets of alcohol percentage, accordingly to each quality level.
In this section I intend to build a linear regression model that could allow us to estimate the wine quality from the two most important attributes we have found in our EDA.
##
## Calls:
## m1: lm(formula = I(quality) ~ I(alcohol), data = wines)
## m2: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity, data = wines)
##
## ==========================================
## m1 m2
## ------------------------------------------
## (Intercept) 1,875*** 3,095***
## (0,175) (0,184)
## I(alcohol) 0,361*** 0,314***
## (0,017) (0,016)
## volatile.acidity -1,384***
## (0,095)
## ------------------------------------------
## R-squared 0,227 0,317
## adj. R-squared 0,226 0,316
## sigma 0,710 0,668
## F 468,267 370,379
## p 0,000 0,000
## Log-likelihood -1721,057 -1621,814
## Deviance 805,870 711,796
## AIC 3448,114 3251,628
## BIC 3464,245 3273,136
## N 1599 1599
## ==========================================
The resultant model was not as good as I expected to, with a final R-square measure of 32%. Since the objective of this project was limited on EDA, I will not try to engineer any feature in order to reach better results, but instead I will discuss some aspects that may explain the model insuccess:
Data is unbalanced: there are few samples of the higher and lower quality levels. Although it’s not a determinant factor in linear regression, in this case it may have impacted.
Weak correlation: the correlation among the attributes were weak, ie. none of them was greater than 50%.
Dependent attributes: for a better modelling, it would be better to target exactly what we were looking for: good wines, in the case. In this analysis we have considered the whole spectrum for quality, with the exception of some analyzes in which we sub-sampled the set for quality levels greater than or equal to 7, but whose findings were not relevant.
Subjectivity: this is the most interesting aspect to me. Unlike a diamond exploratory data analysis, for example, whose measures are limited to physicochemical factors, the main attribute of our dataset (quality) is totally subjective. In [3], for example, the influence of the human subjectivity on wine tasting is criticized. In addition, in our dataset it was not informed how many distinct professionals evaluated the same wine, nor the divergences they may have had, as well as other important aspects that should be considered.
This chart was suggested after my first project review and shows up the main result of this EDA. The alcohol percentage and the volatile acidity are the dataset attributes most related to the quality of red wines. In this chart, the alcohol percentage was plot through buckets allowing to perceive that low quality wines don’t have more than 12% of alcohol, while high quality wines don’t have less than 9% of alcohol. On the other hand, the volatile acidity is the one responsible for the vinegar unpleasant taste and one of the main characteristics of a bad wine. From this chart this negative correlation among volatile acidity and quality is also clear visible. For last, I have added the title and the units for each variable, turning this chart ready to be published.
Through this boxplot I was able to identify the second most important attribute in this dataset in relation to the quality score. Unlike alcohol, the volatile acidity is negatively correlated to the wine quality, which is easy to perceive through the mean calculated for each quality level and represented by an ‘X’.
I confess the radarchart was the only visualization I found harder to build with R than with python. Even so, I spent a long time trying to find the appropriate package and finding out how to fill its parameters. The result, however, was exactly what I expected: a simple chart but one with the most evident relationships I found out in this dataset.
Regarding the purpose of this analysis, some reflections and discussions were included closer to each chart during the EDA process, in order to facilitate their comprehension. A final word on this, though, is that the EDA allowed not only to understand the dataset but also to identify its main characteristics, and two variables that are somehow correlated to the target variable, in this case the quality of the red wines.
Some questions aroused from this EDA that may be explored on future researches:
What is the winery of each evaluated wine? The reason to include such information is to correlate the wine density with the water density and pH of the main hydrographic basins of the regions where the wines are produced, also associating these attributes with the “body” perception.
Using this same dataset, it would be interesting to improve the linear model by changing the approach to select the training data. From EDA we know the attributes we have chosen presen more correlation on the bordering values. In this case, it seems reasonably to model our linear regression taking into account this ranges, by defining some operation thresholds for the model, for example.
Also, it will be interesting to apply machine learning on this dataset, since the EDA is already done.
Regarding the technical aspects of EDA, I’m pretty sure R is now one more amazing tool I will put on my data scientist toolkit. Before I got enrolled in this course, and before putting my hands on dirty with this project, I thought someone should choose among python or R. Now I see both languages are plenty of resources and both can be used complementarily.
[1] P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
Available at http://dx.doi.org/10.1016/j.dss.2009.05.016
[2] http://scitechconnect.elsevier.com/wp-content/uploads/2014/06/main-38.pdf
[3] https://www.theguardian.com/lifeandstyle/2013/jun/23/wine-tasting-junk-science-analysis